Estimating Relevance and Semantic Compatibility for IE Pattern Discovery in Large Text Corpora

ثبت نشده
چکیده

Pattern-based approaches for Information Extraction (IE) typically apply a pattern learner to a set of domain-specific training documents to generate extraction patterns for the IE system. This restricts the coverage of the system primarily to the expressions and language constructs that appear within the limited training data. Our research looks to the vast quantities of readily available text in resources like the Gigaword Corpus to expand the coverage of existing pattern-based IE systems. The learning strategy exploits two inherent characteristics of extraction patterns – relevance and semantic compatibility – for learning new patterns from unannotated sources. We implement a statistical correlation approach for estimating pattern relevance, and perform semantic compatibility ranking of patterns using two separate approaches – one based on vector similarity, and another based on the semantic classes of extractions. We demonstrate an overall increase in the coverage of an existing IE system using this two-phase strategy for pattern learning.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Learning IE patterns: a terminology extraction perspective

The large-scale applicability of knowledge-based information access systems such as the ones based on Information Extraction techniques strongly depends on the possibility of automatically acquiring the large amount of knowledge required. However, the basic assumption of the IE paradigm, i.e. that the information need is known in advance, limits inherently its applicability since the resulting ...

متن کامل

Structural Linguistics and Unsupervised Information Extraction

A precondition for extracting information from large text corpora is discovering the information structures underlying the text. Progress in this direction is being made in the form of unsupervised information extraction (IE). We describe recent work in unsupervised relation extraction and compare its goals to those of grammar discovery for science sublanguages. We consider what this work on gr...

متن کامل

Presentation of an efficient automatic short answer grading model based on combination of pseudo relevance feedback and semantic relatedness measures

Automatic short answer grading (ASAG) is the automated process of assessing answers based on natural language using computation methods and machine learning algorithms. Development of large-scale smart education systems on one hand and the importance of assessment as a key factor in the learning process and its confronted challenges, on the other hand, have significantly increased the need for ...

متن کامل

Presentation of an efficient automatic short answer grading model based on combination of pseudo relevance feedback and semantic relatedness measures

Automatic short answer grading (ASAG) is the automated process of assessing answers based on natural language using computation methods and machine learning algorithms. Development of large-scale smart education systems on one hand and the importance of assessment as a key factor in the learning process and its confronted challenges, on the other hand, have significantly increased the need for ...

متن کامل

Leveraging Giant Text Corpora to Enhance the Coverage of Pattern-based Information Extraction Systems

Pattern-based approaches for Information Extraction typically apply a pattern learner to a set of domain-specific documents to generate extraction patterns that comprise the IE system. This limits the coverage of the system to the expressions and language constructs used within the training data. This research exploits the vast quantities of text readily available in large corpora, such as The ...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2008